Greg Caporaso and Rob Knight
Caporaso Lab, Northern Arizona University
Knight Lab, University of California San Diego
Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.
Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.
A stable and well-documented API and command line interface.
Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.
Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.
End-user, browser-based environment for microbiome analysis and meta-analysis.
A stable and well-documented API and command line interface.
Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.
Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at [readIAB.org
scikit-bio is built on top of scipy, numpy, pandas, etc.
See Jai Rideout and Evan Bolyen's SciPy 2015 talk for more detail and cool demos http://bit.ly/skbio-scipy2015
(We'll tweet the link with #microbenet.)
In [ ]:
from skbio import DNA
import numpy as np
import skbio
## Load Greengenes and slice all sequences to the region
## amplified by 515F/806R.
aligned_seqs_fp = 'data/gg_13_8_otus/rep_set_aligned/82_otus.fasta'
taxonomy_fp = 'data/gg_13_8_otus/taxonomy/82_otu_taxonomy.txt'
fwd_primer = DNA("GTGCCAGCMGCCGCGGTAA",
metadata={'label':'fwd-primer'})
rev_primer = DNA("GGACTACHVGGGTWTCTAAT",
metadata={'label':'rev-primer'}).reverse_complement()
def seq_to_regex(seq):
result = []
for base in str(seq):
if base in DNA.degenerate_chars:
result.append('[{0}]'.format(
''.join(DNA.degenerate_map[base])))
else:
result.append(base)
return ''.join(result)
regex = '({0}.*{1})'.format(seq_to_regex(fwd_primer),
seq_to_regex(rev_primer))
starts = []
stops = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta',
constructor=DNA):
for match in seq.find_with_regex(regex, ignore=seq.gaps()):
starts.append(match.start)
stops.append(match.stop)
locus = slice(int(np.median(starts)), int(np.median(stops)))
## Get all kmers for all sequences.
kmer_counts = []
seq_ids = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta',
constructor=DNA):
seq_ids.append(seq.metadata['id'])
sliced_seq = seq[locus].degap()
kmer_counts.append(sliced_seq.kmer_frequencies(8))
## Load the training/test data, perform feature selection, and train
## and test the classifier using cross-validation.
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
X = DictVectorizer().fit_transform(kmer_counts)
taxonomy_level = 3 # class
id_to_taxon = {}
with open(taxonomy_fp) as f:
for line in f:
id_, taxon = line.strip().split('\t')
id_to_taxon[id_] = '; '.join(taxon.split('; ')[:taxonomy_level])
y = [id_to_taxon[seq_id] for seq_id in seq_ids]
X = SelectPercentile().fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
y_pred = SVC(C=10, kernel='linear', degree=3,
gamma=0.001).fit(X_train, y_train).predict(X_test)
## Define a function to use for plotting.
%matplotlib inline
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.figure()
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
plt.ylabel('Known taxonomy')
plt.xlabel('Predicted taxonomy')
plt.tight_layout()
plt.show()
Self-documenting analyses simplify bioinformatics methods reporting, and therefore reproducibility and transparancy.